NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Fuzzy Integration of Data Lake Tables. In EDBT 2026.

https://doi.org/10.48786/edbt.2026.08

Khatiwada, Aamod; Shraga, Roee; Miller, Renée J (January 2026, OpenProceedings.org)
EDBT (Ed.)
Data integration is an important step in any data science pipeline where the objective is to unify the information available in differ- ent datasets for comprehensive analysis. Full Disjunction, which is an associative extension of the outer join operator, has been shown to be an effective operator for integrating datasets. It fully preserves and combines the available information. Existing Full Disjunction algorithms only consider the equi-join scenario where only tuples having the same value on joining columns are integrated. This, however, does not realistically represent many realistic scenarios where datasets come from diverse sources with inconsistent values (e.g., synonyms, abbreviations, etc.) and with limited metadata. So, joining just on equal values severely limits the ability of Full Disjunction to fully combine datasets. Thus, in this work, we propose an extension of Full Disjunction to also account for “fuzzy” matches among tuples. We present a novel data-driven approach to enable the joining of approximate or fuzzy matches within Full Disjunction. Experimentally, we show that fuzzy Full Disjunction does not add significant time over- head over a state-of-the-art Full Disjunction implementation and also that it enhances the accuracy of a downstream data quality task.
more » « less
Diverse Unionable Tuple Search: Novelty-Driven Discovery in Data Lakes. In EDBT 2026.

https://doi.org/10.48786/edbt.2026.04

Khatiwada, Aamod; Shraga, Roee; Miller, Renée J (January 2026, OpenProceedings.org)
EDBT (Ed.)
Unionable table search techniques input a query table from a user and search for data lake tables that can contribute additional rows to the query table. The definition of unionability is gener- ally based on similarity measures which may include similarity between columns (e.g., value overlap or semantic similarity of the values in the columns) or tables (e.g., similarity of table embed- dings). Due to this and the large redundancy in many data lakes (which can contain many copies and versions of the same table), the most unionable tables may be identical or nearly identical to the query table and may contain little new information. Hence, we introduce the problem of identifying unionable tuples from a data lake that are diverse with respect to the tuples already present in a query table. We perform an extensive experimen- tal analysis of well-known diversity algorithms applied to this novel problem and identify a gap that we address with a novel, clustering-based tuple diversity algorithm called DUST. DUST uses a novel embedding model to represent unionable tuples that outperforms other tuple representation models by at least 15% when representing unionable tuples. Using real data lake bench- marks, we show that our diversification algorithm is more than six times faster than the most efficient diversification baseline. We also show that it is more effective in diversifying unionable tuples than existing diversification algorithms.
more » « less
A Generative Benchmark Creation Framework for Detecting Common Data Table Versions

https://doi.org/10.1145/3627673.3679157

Fox, Daniel C; Khatiwada, Aamod; Shraga, Roee (October 2024, ACM)

Full Text Available
DIALITE: Discover, Align and Integrate Open Data Tables

https://doi.org/10.1145/3555041.3589732

Khatiwada, Aamod; Shraga, Roee; Miller, Renée J. (June 2023, ACM SIGMOD)

Full Text Available
SANTOS: Relationship-based Semantic Table Union Search

https://doi.org/10.1145/3588689

Khatiwada, Aamod; Fan, Grace; Shraga, Roee; Chen, Zixuan; Gatterbauer, Wolfgang; Miller, Renée J.; Riedewald, Mirek (May 2023, Proceedings of the ACM on Management of Data)

Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of the union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover the semantic relationships between pairs of columns. The first uses an existing knowledge base (KB), and the second (which we call a "synthesized KB") uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm, called SANTOS, outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating synthesized KBs from data lakes with limited KB coverage and using them for union search.
more » « less
Full Text Available
Integrating Data Lake Tables

https://doi.org/10.14778/3574245.3574274

Khatiwada, Aamod; Shraga, Roee; Gatterbauer, Wolfgang; Miller, Renée J. (December 2022, Proceedings of the VLDB Endowment)

We have made tremendous strides in providing tools for data scientists to discover new tables useful for their analyses. But despite these advances, the proper integration of discovered tables has been under-explored. An interesting semantics for integration, called Full Disjunction, was proposed in the 1980's, but there has been little progress in using it for data science to integrate tables culled from data lakes. We provide ALITE, the first proposal for scalable integration of tables that may have been discovered using join, union or related table search. We empirically show that ALITE can outperform previous algorithms for computing the Full Disjunction. ALITE relaxes previous assumptions that tables share common attribute names (which completely determine the join columns), are complete (without null values), and have acyclic join patterns. To evaluate ALITE, we develop and share three new benchmarks for integration that use real data lake tables.
more » « less
Full Text Available

Search for: All records